We propose a novel teacher-student model for semi-supervised multi-organ segmentation. In teacher-student model, data augmentation is usually adopted on unlabeled data to regularize the consistent training between teacher and student. We start from a key perspective that fixed relative locations and variable sizes of different organs can provide distribution information where a multi-organ CT scan is drawn. Thus, we treat the prior anatomy as a strong tool to guide the data augmentation and reduce the mismatch between labeled and unlabeled images for semi-supervised learning. More specifically, we propose a data augmentation strategy based on partition-and-recovery N$^3$ cubes cross- and within- labeled and unlabeled images. Our strategy encourages unlabeled images to learn organ semantics in relative locations from the labeled images (cross-branch) and enhances the learning ability for small organs (within-branch). For within-branch, we further propose to refine the quality of pseudo labels by blending the learned representations from small cubes to incorporate local attributes. Our method is termed as MagicNet, since it treats the CT volume as a magic-cube and $N^3$-cube partition-and-recovery process matches with the rule of playing a magic-cube. Extensive experiments on two public CT multi-organ datasets demonstrate the effectiveness of MagicNet, and noticeably outperforms state-of-the-art semi-supervised medical image segmentation approaches, with +7% DSC improvement on MACT dataset with 10% labeled images.
translated by 谷歌翻译
Pre-trained vision-language models like CLIP have recently shown superior performances on various downstream tasks, including image classification and segmentation. However, in fine-grained image re-identification (ReID), the labels are indexes, lacking concrete text descriptions. Therefore, it remains to be determined how such models could be applied to these tasks. This paper first finds out that simply fine-tuning the visual model initialized by the image encoder in CLIP, has already obtained competitive performances in various ReID tasks. Then we propose a two-stage strategy to facilitate a better visual representation. The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID and give them to the text encoder to form ambiguous descriptions. In the first training stage, image and text encoders from CLIP keep fixed, and only the text tokens are optimized from scratch by the contrastive loss computed within a batch. In the second stage, the ID-specific text tokens and their encoder become static, providing constraints for fine-tuning the image encoder. With the help of the designed loss in the downstream task, the image encoder is able to represent data as vectors in the feature embedding accurately. The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks. Code is available at https://github.com/Syliz517/CLIP-ReID.
translated by 谷歌翻译
如果没有图像中的密集瓷砖锚点或网格点,稀疏的R-CNN可以通过以级联的训练方式更新的一组对象查询和建议框来实现有希望的结果。但是,由于性质稀疏以及查询与其参加地区之间的一对一关系,它在很大程度上取决于自我注意力,这通常在早期训练阶段不准确。此外,在密集对象的场景中,对象查询与许多无关的物体相互作用,从而降低了其独特性并损害了性能。本文提议在不同的框之间使用iOU作为自我注意力的价值路由的先验。原始注意力矩阵乘以从提案盒中计算出的相同大小的矩阵,并确定路由方案,以便可以抑制无关的功能。此外,为了准确提取分类和回归的功能,我们添加了两个轻巧投影头,以根据对象查询提供动态通道掩码,并且它们随动态convs的输出而繁殖,从而使结果适合两个不同的任务。我们在包括MS-Coco和CrowdHuman在内的不同数据集上验证了所提出的方案,这表明它可显着提高性能并提高模型收敛速度。
translated by 谷歌翻译
HSI受益于高光谱图像(HSI)中丰富而详细的光谱信息,为各种医学应用(例如计算病理学)提供了巨大的潜力。但是,缺乏足够的注释数据和HSIS的高时光尺寸通常会使分类网络容易过度合适。因此,必须学习可以转移到下游任务的一般表示。据我们所知,没有针对HSIS的组织病理学设计适当的自我监督预训练方法。在本文中,我们引入了一种有效,有效的自我监督光谱回归(S $^3 $ r)方法,该方法利用了HSI光谱域中的低等级特征。更具体地说,我们建议学习一组线性系数,这些系数可通过掩盖这些频段来通过其余的频段来代表一个频段。然后,通过使用学习的系数重新恢复其余频段来恢复频段。设计了两个前文本任务:(1)S $^3 $ R-CR,它回归线性系数,以便预先训练的模型了解HSIS的固有结构以及不同形态的病理特征; (2)S $^3 $ r-BR,它回归缺失的频段,使模型学习了HSIS的整体语义。与先前的艺术相比,重点是自然图像的对比度学习方法,S $^3 $ r收敛至少3倍,并且在转移到HSI分类任务时,准确性高达14%。
translated by 谷歌翻译
可控的人图像合成任务可以通过对身体姿势和外观的明确控制来实现广泛的应用。在本文中,我们提出了一个基于跨注意的样式分布模块,该模块在源语义样式和目标姿势转移的目标姿势之间计算。该模块故意选择每个语义表示的样式,并根据目标姿势分配它们。交叉注意的注意力矩阵表达了目标姿势与所有语义的源样式之间的动态相似性。因此,可以利用它来从源图像路由颜色和纹理,并受到目标解析图的进一步限制,以实现更清晰的目标。同时,为了准确编码源外观,还添加了不同语义样式之间的自我注意力。我们的模型的有效性在姿势转移和虚拟的尝试任务上进行了定量和质量验证。
translated by 谷歌翻译
采用重塑的常见做法是学习模糊和锐利图像对之间的差异,在端到端图像去孔架构之间的差异。从其模糊的对应物重建锐利图像需要有关低频和高频信息的变化。尽管传统的RESBLOCK可以具有良好的能力在捕获图像的高频分量时,但它倾向于俯视低频信息。此外,RESBLOCK通常无法富集地模拟在从其模糊的对应物中重建尖锐图像的不普通的远程信息。在本文中,我们介绍了一种剩余的快速傅里叶变换与卷积块(RES FFT-CONV块),能够捕获长期和短期交互,同时集成低频和高频残差。 RES FFT-CONC模块是一个概念简单但可计算的高效,即插即用块,导致不同架构中的表现增长显着。利用RES FFT-CONV块,我们进一步提出了一种基于MIMO-UNET的深度残留的傅里叶变换(DEEPRFT)框架,在GoPro,隐藏,Realblur和DPDD数据集上实现最先进的图像去孔性能。实验表明我们的DEEPRFT可以显着提高图像去掩饰性能(例如,与MIMO-UNET相比,Gopro Dataset上的PSNR上的1.09 dB改善),DEEPRFT +在GoPro数据集上达到PSNR中的33.23 dB。
translated by 谷歌翻译
为了提高具有低硬件要求的中文文本分类模型的准确性性能,本文设计了一种改进的基于替代的模型,这是5种不同的子模型,包括Textcnn,LSTM和Bi-LSTM的替代。与现有的集合学习方法相比,对于文本分类任务,这种型号的准确性更高。同时,该模型的硬件要求远低于基于BERT的模型。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译